.
FYTH - VAST challenge
.
Assignment 2

_Story of Assignment 1 and visualization of bird sound files_

To-do:

  • Turn your visualizations from Assignment 1 into a report (e.g. Doc file) explaining what you have found so far and what your next steps might be to answer the mini challenge.
  • Visualize a sound file as a signal
  • Collect a list of methods to classify bird sound from the literature, starting from e.g. Automated bird sound recognition in realistic settings and its references.
  • Visualize multiple audio signals and their 2D spectrograms to facilitate visual inspection and comparison
  • Although automated classification can be applied to Kasios sound files, find some visual features that can be highlighted in the visualization to help understand the most important features for discriminating bird sounds? You don't need to be 100% successful of course.
In [14]:
%matplotlib inline
import pandas as pd
#import warnings
#warnings.filterwarnings("ignore")
In [15]:
# our vast librairy
import vast as vst
In [16]:
from pydub import AudioSegment
In [17]:
import os
import re
In [18]:
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import statistics
import librosa
import heapq
In [19]:
#visualization
from ipywidgets import IntProgress
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns 
In [20]:
path = "https://github.com/zhufangda/Telecom_Paristech-3rd_year/raw/master/"\
        +"DATA920_Visualization/2018%20Mini-Challenge%201/"

Results of the first assignement

In [190]:
# vectorize the map
map_contours = vst.vectorize("LekagulRoadways2018.png")
In [191]:
path = "https://github.com/zhufangda/Telecom_Paristech-3rd_year/raw/master/"\
        +"DATA920_Visualization/2018%20Mini-Challenge%201/"
In [192]:
# read the file 
data = pd.read_csv(path + "AllBirdsv4.csv", parse_dates=[5], dayfirst=True)
In [193]:
cat_bird = pd.Categorical(data['English_name'],ordered=False)
i_bp = cat_bird.categories.tolist().index('Rose-crested Blue Pipit')
In [194]:
nb_categories = len(cat_bird.categories.unique())
In [195]:
# Add a column with 1 for the Blue pipits else 0
data['IsRCBP'] = data['English_name'].apply(lambda x: 1 \
                                            if x=="Rose-crested Blue Pipit"\
                                            else 0)
In [196]:
# clean vocalization
# all in lower string
data['Vocalization_type'] = data['Vocalization_type'].apply(str.lower)
# delete spaces
data['Vocalization_type'] = data['Vocalization_type'].apply(str.strip)
In [197]:
# clean the grids 
# returns -1 if not possible to select a number beetween 0 and 200
data['X'] = data['X'].apply(vst.clean_grid)
data['Y'] = data['Y'].apply(vst.clean_grid)
In [198]:
kasios_records = pd.read_csv(path + "Test_Birds_Location.csv")
In [199]:
kasios_records = kasios_records.rename(index=str, columns={" X": "X", " Y":"Y"})
In [200]:
def plotmap(list_kasios_bp=[]):
    fig, ax = plt.subplots(figsize=(10,10))
    # print the map
    vst.print_map(ax, map_contours, "All record locations")
    # plot Blue pipits
    ax.scatter(data.loc[data['IsRCBP']== 1]['X'], 
               data.loc[data['IsRCBP']== 1]['Y'],
               color='blue', alpha=0.7, label="Blue Pipit location")
    # plot others
    ax.scatter(data.loc[data['IsRCBP']== 0]['X'], 
               data.loc[data['IsRCBP']== 0]['Y'],
               color='green', alpha=0.2, label="Other birds location")

    # Add Kasios records with ID
    for i, txt in enumerate(kasios_records['ID'].values):
        if i in list_kasios_bp:
            ax.text(kasios_records['X'].values[i], kasios_records['Y'].values[i], 
                    txt, color='white', ha="center", va="center", 
                    bbox={'pad':0.4, 'boxstyle':'circle', 
                          'edgecolor':'none', 'facecolor':'blue'})
        else:
            ax.text(kasios_records['X'].values[i], kasios_records['Y'].values[i], 
                    txt, color='black', ha="center", va="center", 
                    bbox={'pad':0.4, 'boxstyle':'circle', 
                          'edgecolor':'none', 'facecolor':'orange'})            
    ax.scatter([], [], color='orange', marker='o',s=100, label='Kasios records')
    ax.legend(bbox_to_anchor=(1, 1), labelspacing=1)
    plt.show()

In [201]:
plotmap()
** Main conclusions of the first assignment**
We have enough records to keep only good quality and call&songs sounds.
Blue Pipits nest or have nested IVO the dumping site
We now have to check if Kasios records are Blue Pipit records, escpecially records #1, #11, #6 and #15
In [202]:
%matplotlib inline
from ipywidgets import interactive
import matplotlib.pyplot as plt
import numpy as np

def f(m, b):
    plt.figure(2)
    x = np.linspace(-10, 10, num=1000)
    plt.plot(x, m * x + b)
    plt.ylim(-5, 5)
    plt.show()

interactive_plot = interactive(f, m=(-2.0, 2.0), b=(-3, 3, 0.5))
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot
In [203]:
colors = ['#a6cee3','#009432','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f',
          '#ff7f00','#cab2d6','#6a3d9a','#ffff99','#b15928','#004d40','#7B1FA2',
          '#7C4DFF','#795548','#0652DD','#B53471','#FF9800','#8BC34A','#CDDC39',
          '#b71c1c','#FFC107','#607D8B']  
In [204]:
fig, ax = plt.subplots(figsize=(10,10))
vst.print_map(ax, map_contours, "All bird record locations")
elements = []

for i, categ in enumerate(cat_bird.categories.tolist()):
    X = data.loc[data['English_name'] == categ]['X'].tolist()
    Y = data.loc[data['English_name'] == categ]['Y'].tolist()
    element = ax.scatter(X,Y, color=colors[i], alpha=1, marker='*', zorder=5)
    elements.append([element])
labels = cat_bird.categories.tolist()
#ax.legend(bbox_to_anchor=(1, 1), labelspacing=1)

Sounds file

Sounds file preprocessing

In [205]:
len(data)
Out[205]:
2081

Remove all non usefull data

In [206]:
# delete all useless data : keep only good quality, call and songs
data_sounds = data.loc[data['Quality'] <= 'C']
data_sounds = data_sounds.loc[data.Vocalization_type.isin(['song' , 'call','call, song'])]
In [207]:
len(data_sounds)
Out[207]:
1893
In [208]:
cat_bird2 = pd.Categorical(data_sounds['English_name'],ordered=False)
In [209]:
def plot_new_distribub():
    fig = plt.subplots(figsize=(12,5))
    ax = cat_bird2.value_counts().plot(kind='bar', color='#009432', 
                                      label='Other birds', zorder = 2)
    ax.get_children()[16].set_color('#0652DD')
    ax.get_children()[16].set_label('Blue Pipits')
    plt.xticks(rotation=30, ha='right')
    plt.ylabel("# of records", fontsize=12)
    ax.legend()
    ax.grid(which='major', axis='y', linestyle='--', zorder=1)
    plt.title("Distribution of the records by bird category while selecting call, songs and ABC quality", fontsize=16)
    plt.show()

In [210]:
plot_new_distribub()
We have enough samples to perform ML (1893 samples instead of 2084, more than 170 blue pipits)

First visualizations

Transform the mp3 into wav

In [211]:
#transform mp3 into wav
sound = AudioSegment.from_mp3("Sounds/test3.mp3")
sound = sound.set_frame_rate(44100)
sound = sound.set_channels(1)

sound.export("Sounds/test.wav", format="wav")
Out[211]:
<_io.BufferedRandom name='Sounds/test.wav'>
In [212]:
# open the wav file 
rate, samples = wavfile.read("Sounds/test.wav")
In [213]:
print(rate) ## nb frames per second
44100
- Use wav file (uncompressed)
- Standardize the frequency rate to 44100Hz
- Keep only one channel (stereo to mono)
In [214]:
print(samples.shape)
(5328000,)
In [215]:
n_samples = len(samples)

From stereo to mono

In [216]:
# if stereo, keep only one channel
if samples.ndim == 2:
    samples = samples[:,0]

Plot the whole sound file : magnitude and spectrogram

In [217]:
def log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10):
    """
    This function
    """
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, 10 * np.log10(spec.T.astype(np.float32) + eps)
In [218]:
freqs, times, spectrogram = log_specgram(samples, rate)
In [219]:
def plot_magnitude_spectrogram(samples, rate, freqs, times, spectrogram):
    n_sample = len(samples)
    fig = plt.figure(figsize=(14, 8))
    ax1 = fig.add_subplot(211)
    ax1.set_title('Raw wave')
    ax1.set_ylabel('Magnitude')
    ax1.set_xlim(times.min(), times.max())
    ax1.plot(np.linspace(0, n_sample / rate, n_sample), samples, color='#ff7f00', linewidth=0.05)

    ax2 = fig.add_subplot(212)
    ax2.imshow(spectrogram.T, aspect='auto', origin='lower', 
               extent=[times.min(), times.max(), freqs.min(), freqs.max()], cmap='autumn_r')
    ax2.set_title('Spectrogram ')
    ax2.set_ylabel('Freqs in Hz')
    ax2.set_xlabel('Seconds')
    plt.show()

In [220]:
plot_magnitude_spectrogram(samples, rate, freqs, times, spectrogram)
- Need to normalize timely and powerly
- Noises and silencies could be deleted
- Difficult to compare different records with this kind of graph

Visualize the FFT of the sound

In [221]:
def custom_fft(y, fs):
    """
    Calculate the FFT of the samples y with a rate of fs
    """
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
    power = 10 * np.log10(2.0/N * np.abs(yf[0:N//2]))  # FFT is simmetrical, so we take just the first half
    # FFT is also complex, to we take just the real part (abs)
    return xf, power
In [222]:
def plot_fft(fourier, power):
    plt.figure(figsize=(14,6))
    plt.plot(fourier/1000, power, color='#ff7f00', linewidth=0.05)
    plt.xlabel('Frequency (kHz)', fontsize = 16)
    plt.ylabel('Power (dB)', fontsize = 16)
    plt.show()
In [223]:
fourier, power = custom_fft(samples, rate)

In [224]:
plot_fft(fourier, power)
Advantage of this graph : do not depend of the lenght of the sound file.
We have to have at least three dimensions to characterize the sound file : **time, frequency and power**.

Create features step by step

Using the conclusions of T Papadopoulos, S.J. Roberts, K. Willis, Automated bird sound recognition in realistic settings, Sep.2018

Step 1: select the most energetic frames

"Namely, for each recording we identify the **1% most energetic frames** and we take the **mean energy** of those frames as an estimate of highest level for that particular recording. Following that, **we select all frames of that recording that have power of at least 0.25** of the estimated highest level as corresponding to bird sound and discard the remaining frames."
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
In [225]:
# mean of the 1% most energetic samples 
mean_high_NRJ_samples = np.mean(heapq.nlargest(n_samples//100, np.abs(samples)))
# select only samples of that recording that have power of at least 0.25 of the estimated highest level
samples = np.array([sample for i_sample, sample in enumerate(samples) if np.abs(sample) > 0.25 * mean_high_NRJ_samples])
# update nb frames
n_samples = len(samples)
In [226]:
freqs, times, spectrogram = log_specgram(samples, rate)
fourier, power = custom_fft(samples, rate)

In [227]:
plot_magnitude_spectrogram(samples, rate, freqs, times, spectrogram)
plot_fft(fourier, power)
This step allows removing some noise and all silencies.
Moreover, the lenght (and consequently the size of the file) is reduced from 120s to 7s...

Step 2: split into 20ms frames

"The features we extract are based on FFT-based spectrograms. The spectrogram of each of the training and test recording is computed with a **frame length of 20ms, overlap of 50% ** and rectangular window."
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
The overlapping keep the time component

Already done with the function log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10)

In [228]:
window_size=20
step_size=10

Step 3: Keep only 1-10KHz

"In order to minimize the contribution of low frequency wind noise in the recordings, **only the bins corresponding to frequencies between 1kHz and 10kHz are kept**. Those frequency limits are within the range of what is typically used in existing bird recognition studies and slight changes around these values should not be expected to have a considerable effect on the system’s performance "
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
According to specialists, many bird songs have frequency ranges between 1 and 8 kHz, with maximum sensitivity between 1 and 5 kHz
In [229]:
# keep only fq between 1kHz and 10 kHz
indices = [i for i, fq in enumerate(freqs) if fq>1000 and fq<10000]
freqs_red = freqs[indices]
spectrogram_red = spectrogram[:,indices]

Step 4: normalize

" Four types of frame-level features were considered, all choices from the various types described in Briggs et al. (2009) and Stowell & Plumbley (2014a). More specifically, after **normalising the absolute value of each frame’s spectrum to sum unity**, we computed the mean (denoted here as f mean ), standard deviation (f std ), mode (f mode ) and the difference between the mode of each selected frame and its successor in the original recording "
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
In [230]:
from sklearn.preprocessing import normalize
In [231]:
sequence_spectogram_normalized = normalize(spectrogram_red, norm="l1", axis=1)

In [232]:
plot_magnitude_spectrogram(samples, rate, freqs_red, times, sequence_spectogram_normalized)
NSTR

Step 5: Concatene and compute sequence of 100 frames

"the individual sequences of frame-level features from each **recording of the same species are lined-up in one sequence** (which preserves the original frame sequence in each audio recording but contains ‘cuts’ due to rejected frames and due to the line-up of the different recordings) and **sequences of features obtained from 100 frames** are taken."
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
The lenght of each sequence is 2s (100 * 2ms)
In [233]:
def get_sequences(spectrogram, l_sequence=100, tolerance=0.3):
    nb_frames = len(spectrogram)
    list_sequences = []
    nb_sequences = nb_frames // l_sequence

    if nb_sequences != 0:
        for j in range(nb_sequences - 1):
            list_sequences.append(spectrogram[j * l_sequence: (j+1) * l_sequence])
    else:
        # we have less than n_frames_per_sequence in the frame...
        # overlap the last sequence if there is sufficient information
        if (nb_frames % l_sequence > l_sequence * tolerance):
            list_sequences.append(spectrogram[-l_sequence:])

    return list_sequences
In [234]:
list_sequences = get_sequences(sequence_spectogram_normalized)
In [235]:
sequence_spectogram_normalized = list_sequences[0]
In [236]:
def plot_spectrogram(freqs, times, spectrogram):
    n_sample = len(samples)
    fig = plt.figure(figsize=(14, 4))
    plt.imshow(spectrogram.T, extent=[0, 100, freqs.min(), freqs.max()], aspect='auto', origin='lower', cmap='autumn_r')
    plt.title('Spectrogram ')
    plt.ylabel('Freqs in Hz')
    plt.xlabel('Nb frames')
    plt.show()

In [237]:
plot_spectrogram(freqs_red, times, sequence_spectogram_normalized)
This sequence is now standardized, and represent a good baseline.

Step 6: create features

" We consider three types of binned histograms:
  • two-dimensional histograms of the f_mean and f_std frame-level features with 100 and 50 bins spaced uniformly from 1kHz to 10kHz along the two respective dimensions (corresponding to 5000-dimensional feature vectors)
  • one-dimensional histograms of the f_mode feature with 100 bins spaced uniformly from 1kHz to 10kHz along the f mode dimension (corresponding to 100-dimensional feature vectors)
  • two- dimensional histograms of the f_mode and Δf mode pair of features with 100 bins spaced uniformly from 1kHz to 10kHz along the f mode dimension and 50 bins spaced uniformly from -2kHz to 2kHz along the Δf mode dimension (corresponding to 5000-dimensional feature vectors).
T Papadopoulos, S.J. Roberts, K. Willis, _Automated bird sound recognition in realistic settings_, Sep.2018
$$ f_{mean}(i) = \int \text{spec}(f_i).f_i .\partial f_i$$$$ f_{std}(i) = \sqrt{\int \text{spec}(f_i) (f_i - f_{mean})^2\partial f_i}$$$$ f_{mode}(i) = f_i(F)\text{ , with F} = \arg\max \text{spec}(f_i)$$$$ f_{\Delta mode}(i) = f_{mode} (i) - f_{mode} (i-1)$$
All are vector of size 100 (nb frames)
$\Delta mode$ allows keeping time aspect
In [238]:
# Create edges for the histograms from e_min to e_max
def edges(e_min, e_max, nb_bins):
    return np.arange(e_min, e_max + (e_max-e_min) / nb_bins, (e_max-e_min) / nb_bins)

f_mean_edges = edges(1000, 10000, 100)
f_std_edges = edges(1000, 4000, 50)
f_mode_edges = edges(1000, 10000, 100)
f_delta_mode_edges = edges(-2000, 2000, 50)
In [239]:
n_frames_sequence = min(sequence_spectogram_normalized.shape[0], 100)
In [240]:
# calculate mean, std, mode and delta_mode
f_mean = [np.sum(freqs_red * sequence_spectogram_normalized[i]) for i in range(n_frames_sequence)]
f_std = [(np.abs(np.sum(sequence_spectogram_normalized[i] * ((freqs_red - f_mean[i]) ** 2) ))) ** (0.5) for i in range(n_frames_sequence)]
f_mode = [freqs_red[np.argmax(sequence_spectogram_normalized[i])] for i in range(n_frames_sequence)]
f_delta_mode = np.roll(f_mode, -1) - f_mode
In [241]:
plot_spectrogram(freqs_red, times, sequence_spectogram_normalized)
fig, axs = plt.subplots(1, 3, figsize=(14, 5))
axs[0].plot(range(len(f_mean)),f_mean)
axs[0].set_title("$f_{mean}$", fontsize=16)
axs[1].plot(range(len(f_std)),f_std)
axs[1].set_title("$f_{std}$", fontsize=16)
axs[2].plot(range(len(f_mode)),f_mode)
axs[2].set_title("$f_{mode}$", fontsize=16)
plt.show()
In [242]:
# create histogramms
sequence_histogram1, f_mean_edges, f_std_edges = np.histogram2d(f_mean, f_std, bins=(f_mean_edges, f_std_edges))
sequence_histogram1 = sequence_histogram1.T
sequence_histogram2, f_mode_edges = np.histogram(f_mode, bins=f_mode_edges)
sequence_histogram3, f_mode_edges, f_delta_mode_edges = np.histogram2d(f_mode, f_delta_mode, bins=(f_mode_edges, f_delta_mode_edges))
sequence_histogram3 = sequence_histogram3.T
In [243]:
def plot_features(category, sequence, freq_red,
                  f_mean_edges = edges(1000, 10000, 100),
                  f_std_edges = edges(1000, 4000, 50),
                  f_mode_edges = edges(1000, 10000, 100),
                  f_delta_mode_edges = edges(-2000, 2000, 50),
                  n_frames_per_sequence=100):
    """
    """
    l_sequence = len(sequence)
    if l_sequence < n_frames_per_sequence:
        n_frames_per_sequence = l_sequence
            
    f_mean = [np.sum(freqs_red * sequence[i]) for i in range(n_frames_per_sequence)]
    f_std = [(np.abs(np.sum(sequence[i] * (freqs_red -f_mean[i]) **2 ))) ** (0.5) for i in range(n_frames_per_sequence)]
    f_mode = [freqs_red[np.argmax(sequence[i])] for i in range(n_frames_sequence)]
    f_delta_mode = np.roll(f_mode, -1) - f_mode

    H1, _, _ = np.histogram2d(f_mean, f_std, bins=(f_mean_edges, f_std_edges))
    H2, _ = np.histogram(f_mode, bins=f_mode_edges)
    H3, _, _ = np.histogram2d(f_mode, f_delta_mode, bins=(f_mode_edges, f_delta_mode_edges))
    
    fig, axs = plt.subplots(1, 3, figsize=(14, 5))
    axs[0].imshow(H1.T,interpolation='nearest', origin='low', extent=[f_mean_edges[0], f_mean_edges[-1], f_std_edges[0], f_std_edges[-1]])
    axs[0].set_xlabel("$f_{mean}$", fontsize=16)
    axs[0].set_ylabel("$f_{std}$", fontsize=16)
    axs[0].set_xlim(f_mean_edges[0], f_mean_edges[-1])
    axs[0].set_ylim(f_std_edges[0], f_std_edges[-1])
    axs[0].set_aspect(3)

    axs[1].stem(f_mode_edges[:-1], H2)
    axs[1].set_xlabel("$f_{mode}$", fontsize=16)
    axs[1].set_ylabel("nb of values", fontsize=12)
    axs[1].set_xlim(f_mode_edges[0], f_mode_edges[-1])

    axs[2].imshow(H3.T, interpolation='nearest', origin='low', extent=[f_mode_edges[0], f_mode_edges[-1], f_delta_mode_edges[0], f_delta_mode_edges[-1]])
    axs[2].set_xlabel("$f_{mode}$", fontsize=16)
    axs[2].set_ylabel("$f_{\Delta mode}$", fontsize=16)
    axs[2].set_xlim(f_mode_edges[0], f_mode_edges[-1])
    axs[2].set_ylim(f_delta_mode_edges[0], f_delta_mode_edges[-1])
    axs[2].yaxis.set_label_position("right")
    axs[2].set_aspect(2)

    plt.show()

In [244]:
plot_features(1, sequence_spectogram_normalized, freqs_red)
By computing the mean of several sequences, it will provide a good - and standardized - visualization.
In [245]:
features1 = sequence_histogram1.flatten()
features2 = sequence_histogram2
features3 = sequence_histogram3.flatten()

functions

In [246]:
def get_spectrogram(file, window_size=20, step_size=10):
    """
    This function open a wav file and return the desired normalized spectrogram
    inputs : 
        str file : name of the sound file (wav)
        int window_size : lenght of a frame in ms
        int step_size : for the overlapping
    output : 
        int category : category of the bird according to the name of the file
        array spectogram_normalized : the normalized spectrogram
        list freqs_red : list of frequences
    """
    
    record_id, _ = re.split('.wav',file)
    bird_category = data_sounds.loc[data_sounds['File ID']==int(record_id)]['English_name'].values[0]
    category = cat_bird.categories.tolist().index(bird_category)
    
    rate, frames = wavfile.read("Sounds/Out/" + file)
    n_frames = len(frames)
    
    # if stereo, keep only one channel
    if frames.ndim == 2:
        frames = frames[:,0]
    
    # Step 1 : select the most energetic frames
    # mean of the 1% most energetic frames 
    mean_high_NRJ_frames = np.mean(heapq.nlargest(n_frames//100, np.abs(frames)))
    # select only frames of that recording that have power of at least 0.25 of the estimated highest level
    frames = np.array([frame for i_frame, frame in enumerate(frames) if np.abs(frame) > 0.25 * mean_high_NRJ_frames])
    # update nb frames
    n_frames = len(frames)
    
    # Step 2 : split into 20ms chunks and get spectrogram
    window_size=20
    step_size=10
    freqs, times, spectrogram = log_specgram(frames, rate, window_size=window_size, step_size=step_size)
    
    # Step 3 : keep only 1-10KHz
    # keep only fq between 1kHz and 10 kHz
    indices = [i for i, fq in enumerate(freqs) if fq>1000 and fq<10000]   
    spectrogram_red = spectrogram[:,indices]
    freqs_red = freqs[indices]
    # Step 4: normalize
    spectogram_normalized = normalize(spectrogram_red, norm="l1", axis=1)
    
    return category, spectogram_normalized, freqs_red
In [247]:
def get_spectrogram_kasios(krecord_id, window_size=20, step_size=10):
    """
    This function open a wav file and return the desired normalized spectrogram
    inputs : 
        str file : name of the sound file (wav)
        int window_size : lenght of a frame in ms
        int step_size : for the overlapping
    output : 
        int category : category of the bird according to the name of the file
        array spectogram_normalized : the normalized spectrogram
        list freqs_red : list of frequences
    """

    rate, frames = wavfile.read("Sounds_Kasios/Out/" + str(krecord_id) + ".wav")
    n_frames = len(frames)
    
    # if stereo, keep only one channel
    if frames.ndim == 2:
        frames = frames[:,0]
    
    # Step 1 : select the most energetic frames
    # mean of the 1% most energetic frames 
    mean_high_NRJ_frames = np.mean(heapq.nlargest(n_frames//100, np.abs(frames)))
    # select only frames of that recording that have power of at least 0.25 of the estimated highest level
    frames = np.array([frame for i_frame, frame in enumerate(frames) if np.abs(frame) > 0.25 * mean_high_NRJ_frames])
    # update nb frames
    n_frames = len(frames)
    
    # Step 2 : split into 20ms chunks and get spectrogram
    window_size=20
    step_size=10
    freqs, times, spectrogram = log_specgram(frames, rate, window_size=window_size, step_size=step_size)
    
    # Step 3 : keep only 1-10KHz
    # keep only fq between 1kHz and 10 kHz
    indices = [i for i, fq in enumerate(freqs) if fq>1000 and fq<10000]   
    spectrogram_red = spectrogram[:,indices]
    freqs_red = freqs[indices]
    # Step 4: normalize
    spectogram_normalized = normalize(spectrogram_red, norm="l1", axis=1)
    
    return spectogram_normalized, freqs_red
In [248]:
def get_sequences(spectrogram, l_sequence=100, tolerance=0.3):
    """
    split the spectrogram in spectrograms of size l_sequence (default 100)
        and keep the last one if its lenght is longer than tolerance * l_sequence
            (overlap with the previous one)
    """
    nb_frames = len(spectrogram)
    list_sequences = []
    nb_sequences = nb_frames // l_sequence

    if nb_sequences != 0:
        for j in range(nb_sequences):
            list_sequences.append(spectrogram[j * l_sequence: (j+1) * l_sequence])
    else:
        # we have less than n_frames_per_sequence in the frame...
        # overlap the last sequence if there is sufficient information
        if (nb_frames % l_sequence > l_sequence * tolerance):
            list_sequences.append(spectrogram[-l_sequence:])

    return list_sequences
In [249]:
def get_sequences_per_category(all_spectrograms, lenght_sequence=100, tolerance=0.3):
    """
    This function transform a list of spectrograms
    ---------------------------------------------------------------------------
    inputs:
        all_spectrograms : list of spectrograms listed by category
        lenght_sequence : number of frame per sequence
        tolerance : keep only sequences 
                        with a lenght at least tolerance * lenght_sequence
    ---------------------------------------------------------------------------
    outputs:
        all_sequences : list of sequences listed by category
    """
    
    # concatene all spectrogram per category
    list_unique_spec = [[] for _ in range(nb_categories)]
    for category in range(nb_categories):
        if len(all_spectrograms[category])==0:
            continue
        list_unique_spec[category] = np.vstack(all_spectrograms[category])
    
    all_sequences = [get_sequences(list_unique_spec[category],
                                           l_sequence=lenght_sequence,
                                           tolerance=tolerance) 
                     for category in range(nb_categories)]

    return all_sequences
In [250]:
def edges(e_min, e_max, nb_bins):
    return np.arange(e_min, e_max + (e_max-e_min) / nb_bins, (e_max-e_min) / nb_bins)
In [251]:
def get_features(category, sequence, freq_red,
                 f_mean_edges = edges(1000, 10000, 100),
                 f_std_edges = edges(1000, 4000, 50),
                 f_mode_edges = edges(1000, 10000, 100),
                 f_delta_mode_edges = edges(-2000, 2000, 50),
                 n_frames_per_sequence=100):
    """
    """
    l_sequence = len(sequence)
    if l_sequence < 100:
        for i in range(l_sequence, 100):
            sequence.append(sequence[l_sequence-1])
            
    f_mean = [np.sum(freqs_red * sequence[i]) for i in range(n_frames_per_sequence)]
    f_std = [(np.abs(np.sum(sequence[i] * (freqs_red -f_mean[i]) **2 ))) ** (0.5) for i in range(n_frames_per_sequence)]
    f_mode = [freqs_red[np.argmax(sequence[i])] for i in range(n_frames_sequence)]
    f_delta_mode = np.roll(f_mode, -1) - f_mode

    sequence_histogram1, _, _ = np.histogram2d(f_mean, f_std, bins=(f_mean_edges, f_std_edges))
    features1 = sequence_histogram1.T.flatten()

    sequence_histogram2, _ = np.histogram(f_mode, bins=f_mode_edges)
    features2 = sequence_histogram2

    sequence_histogram3, _, _ = np.histogram2d(f_mode, f_delta_mode, bins=(f_mode_edges, f_delta_mode_edges))
    features3 = sequence_histogram3.T.flatten()
    
    return np.concatenate([[category], features1, features2, features3])
In [252]:
def get_features2(category, sequence, freq_red,
                  n_frames_per_sequence=100):
    """
    """
    l_sequence = len(sequence)
    if l_sequence < 100:
        for i in range(l_sequence, 100):
            sequence.append(sequence[l_sequence-1])
            
    f_mean = [np.sum(freqs_red * sequence[i]) for i in range(n_frames_per_sequence)]
    f_std = [(np.abs(np.sum(sequence[i] * (freqs_red -f_mean[i]) **2 ))) ** (0.5) for i in range(n_frames_per_sequence)]
    f_mode = [freqs_red[np.argmax(sequence[i])] for i in range(n_frames_sequence)]
    f_delta_mode = np.roll(f_mode, -1) - f_mode

    return np.concatenate([[category], f_mean, f_std, f_mode, f_delta_mode])
In [253]:
def plot2_features(title, H1, H2, H3, freq_red,
                   f_mean_edges = edges(1000, 10000, 100),
                   f_std_edges = edges(1000, 4000, 50),
                   f_mode_edges = edges(1000, 10000, 100),
                   f_delta_mode_edges = edges(-2000, 2000, 50),
                   n_frames_per_sequence=100):
    """
    """
    fig, axs = plt.subplots(1, 3, figsize=(14, 5))
    axs[0].imshow(H1.T,interpolation='nearest', origin='low', extent=[f_mean_edges[0], f_mean_edges[-1], f_std_edges[0], f_std_edges[-1]])
    axs[0].set_xlabel("$f_{mean}$", fontsize=16)
    axs[0].set_ylabel("$f_{std}$", fontsize=16)
    axs[0].set_xlim(f_mean_edges[0], f_mean_edges[-1])
    axs[0].set_ylim(f_std_edges[0], f_std_edges[-1])
    axs[0].set_aspect(3)

    axs[1].stem(f_mode_edges[:-1], H2)
    axs[1].set_xlabel("$f_{mode}$", fontsize=16)
    axs[1].set_ylabel("nb of values", fontsize=12)
    axs[1].set_xlim(f_mode_edges[0], f_mode_edges[-1])

    axs[2].imshow(H3.T, interpolation='nearest', origin='low', extent=[f_mode_edges[0], f_mode_edges[-1], f_delta_mode_edges[0], f_delta_mode_edges[-1]])
    axs[2].set_xlabel("$f_{mode}$", fontsize=16)
    axs[2].set_ylabel("$f_{\Delta mode}$", fontsize=16)
    axs[2].set_xlim(f_mode_edges[0], f_mode_edges[-1])
    axs[2].set_ylim(f_delta_mode_edges[0], f_delta_mode_edges[-1])
    axs[2].yaxis.set_label_position("right")
    axs[2].set_aspect(2)
     
    plt.suptitle(title, fontsize=24)
    plt.show()
In [254]:
def plot3_features(title, H1, H2, H3, freq_red, axs,
                   f_mean_edges = edges(1000, 10000, 100),
                   f_std_edges = edges(1000, 4000, 50),
                   f_mode_edges = edges(1000, 10000, 100),
                   f_delta_mode_edges = edges(-2000, 2000, 50),
                   n_frames_per_sequence=100):
    """
    """
    axs[0].imshow(H1.T,interpolation='nearest', origin='low', extent=[f_mean_edges[0], f_mean_edges[-1], f_std_edges[0], f_std_edges[-1]])
    axs[0].set_xlim(f_mean_edges[0], f_mean_edges[-1])
    axs[0].set_ylim(f_std_edges[0], f_std_edges[-1])
    axs[0].set_aspect(3)
    axs[1].stem(f_mode_edges[:-1], H2)
    axs[1].set_xlim(f_mode_edges[0], f_mode_edges[-1])
    axs[1].set_title(title, fontsize=16)
    #axs[1].set_aspect(2)
    axs[2].imshow(H3.T, interpolation='nearest', origin='low', extent=[f_mode_edges[0], f_mode_edges[-1], f_delta_mode_edges[0], f_delta_mode_edges[-1]])
    axs[2].set_xlim(f_mode_edges[0], f_mode_edges[-1])
    axs[2].set_ylim(f_delta_mode_edges[0], f_delta_mode_edges[-1])
    axs[2].yaxis.set_label_position("right")
    axs[2].set_aspect(2)

Scalability

Create wav from mp3

Transform all mp3 from the In folder that correspond to a good record into wav to Out folder with a rate of 44100 Hz

f = IntProgress(min=0, max=len(data), description='Load files:', bar_style='success') # instantiate the bar
print("progress")
display(f) # display the bar
for file in os.listdir("Sounds/In/"):

    record_id = int(max(re.findall('\d+', file), key=len))
    # if the ID is in data_sounds, we apply 
    if record_id in data_sounds['File ID'].values:
        sound = AudioSegment.from_mp3("Sounds/In/"+file)
        sound = sound.set_frame_rate(44100)
        sound = sound.set_channels(1)
        sound.export("Sounds/Out/" + str(record_id) + ".wav", format="wav")
    f.value += 1
print("done !")

Create features

In [255]:
from multiprocessing import Pool
import multiprocessing
from tqdm import tqdm_notebook as tqdm
cores = multiprocessing.cpu_count()
pool = Pool(processes=cores)
file_list = list(os.listdir("Sounds/Out/"))

all_spectrograms = [[] for _ in range(nb_categories)]

for category, spectrogram, freqs_red in tqdm(pool.imap_unordered(get_spectrogram, file_list), total=len(file_list)):
    all_spectrograms[category].append(spectrogram)
# retrieve sequences from spectrograms
all_sequences = get_sequences_per_category(all_spectrograms)

with histograms

# create the dataset from the sequences
dataset = pd.DataFrame(columns=range(10101))
n_rows = 0
for category in range(nb_categories):
    for j, sequence in enumerate(all_sequences[category]):
        dataset.loc[n_rows] = get_features(category, sequence, freqs_red)
        n_rows += 1
dataset.head()
# create the dataset
dataset.to_csv("dataset.csv", float_format='%.3f')
dataset2 = dataset.copy()

dataset2 = dataset2.astype(int)

dataset2.to_csv("dataset2.csv")

len(dataset)

Instead of executing previous lines of commands, just load the dataset.

In [256]:
dataset = pd.read_csv("dataset2.csv")
In [257]:
dataset = dataset.drop(dataset.columns[0], axis=1)  

Compare differents species

Using the mean of each feature by specy...

In [258]:
dataset_category = dataset.groupby(['0']).mean()
In [259]:
dataset_category.insert(0, '0', 0.0)
In [260]:
categories  = []
features1 = []
features2 = []
features3 = []
for row in dataset_category.iterrows():
    index, data_i = row
    categories.append(index)
    features1.append(np.reshape(data_i.values[0:5000], (50, 100)).T)
    features2.append(data_i.values[5000:5100])
    features3.append(np.reshape(data_i.values[5100:10100], (50, 100)).T)
freqs = np.array(range(1050, 10000, 50))

Plot the features

In [261]:
for i, H1, H2, H3 in zip(categories, features1, features2, features3):
    plot2_features(cat_bird.categories.tolist()[i], H1, H2, H3, freqs)
The three histograms are usefull to discriminate birds.

Similarity

In [262]:
from skimage.measure import compare_ssim
In [263]:
ssim_beetween = np.array([[np.mean([compare_ssim(features1[i], features1[j], full=True)[0], 
                                    compare_ssim(features2[i], features2[j], full=True)[0],
                                    compare_ssim(features3[i], features3[j], full=True)[0]])
                           for i in range(nb_categories)] 
                          for j in range(nb_categories)])
In [264]:
#sim2 = np.array([[np.mean([np.linalg.norm(features1[i]-features1[j], ord='nuc'), 
#                           np.linalg.norm(features2[i]-features2[j]), 
#                           np.linalg.norm(features3[i]-features3[j], ord='nuc')]) 
#                           for i in range(nb_categories)] 
#                          for j in range(nb_categories)])
In [265]:
from mpl_toolkits.axes_grid1 import make_axes_locatable
In [266]:
def plot_heatmap(data, row_labels, col_labels, title="",
                    cbar_kw={}, cbarlabel="", txt=False, **kwargs):
    """
    Plot a heatmap from a numpy array and two lists of labels.

    Arguments:
        data       : A 2D numpy array of shape (N,M)
        row_labels : A list or array of length N with the labels
                     for the rows
        col_labels : A list or array of length M with the labels
                     for the columns
    Optional arguments:
        ax         : A matplotlib.axes.Axes instance to which the heatmap
                     is plotted. If not provided, use current axes or
                     create a new one.
        cbar_kw    : A dictionary with arguments to
                     :meth:`matplotlib.Figure.colorbar`.
        cbarlabel  : The label for the colorbar
    All other arguments are directly passed on to the imshow call.
    """
    fig, ax = plt.subplots(figsize=(10,10))
    
    # Plot the heatmap
    im = ax.imshow(data, **kwargs)

    # Create colorbar
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.1)
    
    cbar = ax.figure.colorbar(im, cax=cax, **cbar_kw)
    
    
    
    cbar.ax.set_ylabel(cbarlabel, rotation=-90, va="bottom")

    # We want to show all ticks...
    ax.set_xticks(np.arange(data.shape[1]))
    ax.set_yticks(np.arange(data.shape[0]))
    # ... and label them with the respective list entries.
    ax.set_xticklabels(col_labels)
    ax.set_yticklabels(row_labels)

    # Let the horizontal axes labeling appear on top.
    #ax.tick_params(top=True, bottom=False,
    #               labeltop=True, labelbottom=False)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")
    
    # Loop over data dimensions and create text annotations.
    if txt==True:
        for i in range(len(col_labels)):
            for j in range(len(row_labels)):
                text = ax.text(j, i, np.round(data[i, j],2),
                               ha="center", va="center", color="black")

    
    # Turn spines off and create white grid.
    for edge, spine in ax.spines.items():
        spine.set_visible(False)

    ax.set_xticks(np.arange(data.shape[1]+1)-.5, minor=True)
    ax.set_yticks(np.arange(data.shape[0]+1)-.5, minor=True)
    ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
    ax.tick_params(which="minor", bottom=False, left=False)
    ax.set_title(title, fontsize=30)
    fig.tight_layout()
    plt.show()

In [267]:
plot_heatmap(ssim_beetween, cat_bird.categories.tolist(), cat_bird.categories.tolist(), "Similarity beetween species", cmap="YlGn", cbarlabel="ssim", txt=True)
This kind of graphic allows to compare more easely differents sounds

Graph of similarities

In [6]:
import networkx as nx
from networkx.readwrite import json_graph
import json
from IPython.display import display, HTML, Javascript
In [5]:
labels = {}
for i, cat in enumerate(cat_bird.categories.tolist()):
    labels[i]=cat
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-6523be1540f2> in <module>()
      1 labels = {}
----> 2 for i, cat in enumerate(cat_bird.categories.tolist()):
      3     labels[i]=cat

NameError: name 'cat_bird' is not defined
In [270]:
def create_graph(similarity, threshold):
    n = similarity.shape[0]
    G = nx.Graph()
    G.add_nodes_from([i for i in range(n)])
    for i in range(n):
        for j in range(i+1, n):
            if similarity[i,j]> threshold:
                G.add_edge(i, j, weight=( similarity[i,j] - threshold) * 100 ) 
    return G
In [271]:
G = create_graph(ssim_beetween, 0.75)

for ix in G.nodes():
    G.node[ix]['category'] = labels[ix]
    G.node[ix]['isKasios'] = 0

for ix,deg in G.degree():
    G.node[ix]['degree'] = deg
    G.node[ix]['parity'] = (1-deg%2)
    #G.node[ix]['katz'] = 0.1
for ix,katz in nx.katz_centrality(G).items():
    G.node[ix]['katz'] = katz
In [2]:
# create a json file
datajson = json_graph.node_link_data(G)
with open('graphsim.json', 'w') as f:
    json.dump(datajson, f, indent=4)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-21c4426c4aec> in <module>()
      1 # create a json file
----> 2 datajson = json_graph.node_link_data(G)
      3 with open('graphsim.json', 'w') as f:
      4     json.dump(datajson, f, indent=4)

NameError: name 'G' is not defined
In [7]:
js_getResults = """<div id="d3-container"></div>

<style>

.node {stroke: #fff; stroke-width: 1.5px;}
.link {stroke: #999; stroke-opacity: .6;}

text {
  font: 14px sans-serif;
  pointer-events: none;
  text-shadow: 0 1px 0 #fff, 1px 0 0 #fff, 0 -1px 0 #fff, -1px 0 0 #fff;
}

</style>


<script>

// We load the latest version of d3.js from the Web.
require.config({paths: {d3: "https://d3js.org/d3.v3.min"}});
require(["d3"], function(d3) {
    
    // Parameter declaration, the height and width of our viz.
    var width = 800,
        height = 800;

    // Colour scale for node colours.
    var color = d3.scale.category10();

    // We create a force-directed dynamic graph layout.
    // D3 has number of layouts - refer to the documentation.
    var force = d3.layout.force()
        .charge(-1000)
        .linkDistance(150)
        .size([width, height]);

    // We select the < div> we created earlier and add an <svg> container.
    // SVG = Scalable Vector Graphics
    var svg = d3.select("#d3-container").select("svg")
    if (svg.empty()) {
        svg = d3.select("#d3-container").append("svg")
                    .attr("width", width)
                    .attr("height", height);
    }
        
    // We load the JSON network file.
    d3.json("graphsim.json", function(error, graph) {
        // Within this block, the network has been loaded
        // and stored in the 'graph' object.
        
        force.nodes(graph.nodes)
            .links(graph.links)
            .start();

        var link = svg.selectAll(".link")
            .data(graph.links)
            .enter().append("line")
            .attr('stroke-width', function(d) { return d.weight; })
            .attr("class", "link");

        var node = svg.selectAll(".node")
            .data(graph.nodes)
            .enter().append("g")
            .attr("class", "node")
            .call(force.drag)
            .on('click', connectedNodes);
        node.append("circle")
            .attr("r", 15)  // radius
            .style("fill", function(d) {return  d.id==16 ?'#0652DD':'#009432';})
            
       
        node.append("text")
            .attr("dx", function(d){return d.isKasios==0 ? 10 : -5;})
            .attr("dy", ".15em")
            .text(function(d) { return d.category ;})
            .attr("stroke", "black");
            
        node.append("title")
            .text(function(d) { return d.category ;});
       
        force.on("tick", function() {
            link.attr("x1", function(d) { return d.source.x; })
                .attr("y1", function(d) { return d.source.y; })
                .attr("x2", function(d) { return d.target.x; })
                .attr("y2", function(d) { return d.target.y; });
            node.attr("cx", function(d) { return d.x; })
                .attr("cy", function(d) { return d.y; });
                d3.selectAll("circle").attr("cx", function (d) {
                    return d.x;
                })
                    .attr("cy", function (d) {
                    return d.y;
                });
                d3.selectAll("text").attr("x", function (d) {
                    return d.x;
                })
                    .attr("y", function (d) {
                    return d.y;
                });
            });
        
        var toggle = 0;


        var linkedByIndex = {};
        for (var i = 0; i < graph.nodes.length; i++) {
            linkedByIndex[i + "," + i] = 1;
        };
        graph.links.forEach(function (d) {
            linkedByIndex[d.source.index + "," + d.target.index] = 1;
        });
        function neighboring(a, b) {
            return linkedByIndex[a.index + "," + b.index];
        }
     
        function connectedNodes() {
            if (toggle == 0) {
                var d = d3.select(this).node().__data__;
                
                link.style("opacity", function (o) {
                    return d.id==o.source.index | d.index==o.target.index ? 1 : 0.8;
                });
                link.style("stroke-width", function (o) {
                    return d.index==o.source.index | d.index==o.target.index ?  o.weight : 0.8;
                });
                node.style("opacity", function (o) {
                    return neighboring(d, o) | neighboring(o, d) ? 1 : 0.3;
                });
                //Reset the toggle.
                toggle = 1;
            } else {
                //Restore everything back to normal
                node.style("opacity", 1);
                link.style("opacity", 1);
                link.style("stroke-width",function(d) { return  d.weight; });
                toggle = 0;
            }
        } 

        
    });
});
</script>
"""

In [8]:
display(HTML(js_getResults))

Kasios records

f = IntProgress(min=0, max=15, description='Load files:', bar_style='success') # instantiate the bar
display(f) # display the bar
for file in os.listdir("Sounds_Kasios/In/"):

    record_id = int(max(re.findall('\d+', file), key=len))
    # if the ID is in data_sounds, we apply 
    sound = AudioSegment.from_mp3("Sounds_Kasios/In/"+file)
    sound = sound.set_frame_rate(44100)
    sound = sound.set_channels(1)
    sound.export("Sounds_Kasios/Out/" + str(record_id) + ".wav", format="wav")
    f.value += 1
nb_kasios = f.value
list_kasios_sequences = [[] for i in range(nb_kasios)]
for i in range(nb_kasios):
    t_spectogram_normalized, t_freqs_red = get_spectrogram_kasios(i+1)
    list_kasios_sequences[i].extend(get_sequences(t_spectogram_normalized))


# create the dataset from the sequences
dataset_kasios = pd.DataFrame(columns=range(10101))
n_rows = 0
for id in range(nb_kasios):
    for _, sequence in enumerate(list_kasios_sequences[id]):
        dataset_kasios.loc[n_rows] = get_features(id, sequence, t_freqs_red)
        n_rows += 1
dataset_kasios.to_csv("dataset_kasios.csv", float_format='%.3f')
In [275]:
dataset_kasios = pd.read_csv("dataset_kasios.csv")
In [276]:
dataset_kasios = dataset_kasios.drop(dataset_kasios.columns[0], axis=1)  
In [277]:
kasios_mean = dataset_kasios.groupby(['0']).mean()
In [278]:
kasios_mean.insert(0, '0', 0.0)
In [279]:
l_kasios_id  = []
features1k = []
features2k = []
features3k = []
for row in kasios_mean.iterrows():
    index, data_i = row
    l_kasios_id.append(index)
    features1k.append(np.reshape(data_i.values[0:5000], (50, 100)).T)
    features2k.append(data_i.values[5000:5100])
    features3k.append(np.reshape(data_i.values[5100:10100], (50, 100)).T)
freqs = np.array(range(1050, 10000, 50))

In [280]:
for i, H1, H2, H3 in zip(l_kasios_id, features1k, features2k, features3k):
    plot2_features("Kasios record #" + str(int(i+1)), H1, H2, H3, freqs)
In [281]:
kasios_label = ["Kasios record #" + str(int(i+1)) for i in range(nb_kasios)]
In [282]:
ssim_k = np.array([[np.mean([compare_ssim(features1[i], features1k[j], full=True)[0], 
                             compare_ssim(features2[i], features2k[j], full=True)[0],
                             compare_ssim(features3[i], features3k[j], full=True)[0]])
                           for i in range(nb_categories)] 
                          for j in range(nb_kasios)])
In [283]:
plot_heatmap(ssim_k, kasios_label, cat_bird.categories.tolist(), 
                "Similarity beetween Kasios birds and species", 
                cmap="YlGn", cbarlabel="ssim")

Classification

In [284]:
from sklearn.neighbors import KNeighborsClassifier
In [285]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
In [286]:
# dataset with all 10100 fatures
dataset3 = dataset.copy()
y = dataset3['0']
X = dataset3.drop(dataset3.columns[0], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
In [287]:
scaler1 = StandardScaler()
scaler1.fit(X_train)
X_train = scaler1.transform(X_train)
X_test = scaler1.transform(X_test)
/home/hhours/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
/home/hhours/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
  This is separate from the ipykernel package so we can avoid doing imports until
/home/hhours/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
  after removing the cwd from sys.path.
In [290]:
kasios_id = dataset_kasios['0']
X_kasios = dataset_kasios.drop(dataset_kasios.columns[0], axis=1)
X_kasios = scaler1.transform(X_kasios)

Multi-class Classification

In [291]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier

In [292]:
random_forest = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, random_state=1))
random_forest.fit(X_train, y_train)
Out[292]:
OneVsRestClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False),
          n_jobs=None)
In [293]:
random_forest.score(X_test, y_test)
Out[293]:
0.5107142857142857
0.51 in 19-multiclass but for each specie??
In [294]:
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(y_test, random_forest.predict(X_test), normalize=True, figsize=(16,9))
Out[294]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdda53ac5c0>
In [295]:
probas = random_forest.predict_proba(X_kasios)
In [296]:
i = 0
probas_kasios = []
for j in range(nb_kasios):
    probas_kasios.append(np.mean(np.array([probas[i+k] for k in range(kasios_id.value_counts()[j])]), axis=0))
    i+=kasios_id.value_counts()[j]
probas_kasios = np.array(probas_kasios)
In [297]:
plot_heatmap(probas_kasios,kasios_label, cat_bird.categories.tolist(), "probability Kasios = specy, 10100 features", cmap="YlGn", cbarlabel="probability")
At a first glance, we don't have (except perhaps for K10, K11 and K12) any certitude on the Kasios birds categories. Lets observe these results with a graph.

Visualization on a graph

In [299]:
probas_kasios_labels = []
for k in range(nb_kasios):
    label = ""
    best_probas = heapq.nlargest(5, zip(np.round(probas_kasios[k,:],2), cat_bird.categories.tolist()))
    for i, (proba, catgory) in enumerate(best_probas):
        label += "P" + str(i+1) + ": " + str(proba)+ " - " + catgory + "\n"
    probas_kasios_labels.append(label)
In [300]:
def create_graph2(similarity, probas, threshold1=0.75, threshold_probas=0.1):
    (p, n) = probas.shape
    G = nx.Graph()
    for i in range(n+p):
        G.add_node(i)
    for i in range(n):
        for j in range(i+1, n):
            if (similarity[i,j] > threshold1):
                G.add_edge(i, j, weight= (similarity[i,j] - threshold1) * 100, stype = 0)
    for j in range(p):
        for i in range(n):
            if (probas[j, i] > threshold_probas):
                G.add_edge(n + j, i, weight = 2 * np.sqrt((probas[j, i] - threshold_probas) * 100), stype = 1)
    return G
In [301]:
G2 = create_graph2(ssim_beetween, probas_kasios)

for ix in G2.nodes():
    if ix < nb_categories:
        G2.node[ix]['category'] = cat_bird.categories.tolist()[ix]
        G2.node[ix]['tip'] = cat_bird.categories.tolist()[ix]
        G2.node[ix]['isKasios'] = 0
    else:
        G2.node[ix]['category'] = 'K' + str(ix - nb_categories + 1)
        G2.node[ix]['tip'] = probas_kasios_labels[ix - nb_categories]
        G2.node[ix]['isKasios'] = 1
for ix,deg in G2.degree():
    G2.node[ix]['degree'] = deg
    G2.node[ix]['parity'] = (1-deg%2)
    #G.node[ix]['katz'] = 0.1
for ix,katz in nx.katz_centrality(G2).items():
    G2.node[ix]['katz'] = katz
In [302]:
# create a json file
datajson2 = json_graph.node_link_data(G2)
with open('graphsim2.json', 'w') as f:
    json.dump(datajson2, f, indent=4)
In [24]:
js_getResults2 = """<div id="d3-container2"></div>

<style>

.node {stroke: #fff; stroke-width: 1.5px;}

text {
  font: 12px sans-serif;
  pointer-events: none;
  
}

</style>


<script>

// We load the latest version of d3.js from the Web.
require.config({paths: {d3: "https://d3js.org/d3.v3.min"}});
require(["d3"], function(d3) {
    
    // Parameter declaration, the height and width of our viz.
    var width = 800,
        height = 800;


    // We create a force-directed dynamic graph layout.
    // D3 has number of layouts - refer to the documentation.
    var force = d3.layout.force()
        .charge(-1000)
        .linkDistance(125)
        .size([width, height]);
    
    
    // We select the < div> we created earlier and add an <svg> container.
    // SVG = Scalable Vector Graphics
    var svg = d3.select("#d3-container2").select("svg")
    if (svg.empty()) {
        svg = d3.select("#d3-container2").append("svg")
                    .attr("width", width)
                    .attr("height", height)
    }
        
    // We load the JSON network file.
    d3.json("graphsim2.json", function(error, graph) {
        // Within this block, the network has been loaded
        // and stored in the 'graph' object.

        
        force.nodes(graph.nodes)
            .links(graph.links)
            .start();
        
        var link = svg.selectAll(".link")
            .data(graph.links)
            .enter().append("line")
            .attr('stroke-width', function(d) { return d.weight})
            .attr("class", "link")
            .style("stroke", function(d) { return d.stype==0 ? '#3e2723': '#212121'});

        var node = svg.selectAll(".node")
            .data(graph.nodes)
            .enter().append("g")
            .attr("class", "node")
            .call(force.drag)
            .on('click', connectedNodes);
        node.append("circle")
            .attr("r", 15)  // radius
            .style("fill", function(d) {return  colornode(d.id)})
       
        node.append("text")
            .attr("dx", function(d){return d.isKasios==0 ? 10 : -7;})
            .attr("dy", ".15em")
            .text(function(d) { return d.category ;})
            .attr("stroke", "black");
            
        node.append("title")
            .text(function(d) { return d.tip ;});

        force.on("tick", function() {
            link.attr("x1", function(d) { return d.source.x; })
                .attr("y1", function(d) { return d.source.y; })
                .attr("x2", function(d) { return d.target.x; })
                .attr("y2", function(d) { return d.target.y; });
            node.attr("cx", function(d) { return d.x; })
                .attr("cy", function(d) { return d.y; });
                d3.selectAll("circle").attr("cx", function (d) {
                    return d.x;
                })
                    .attr("cy", function (d) {
                    return d.y;
                });
                d3.selectAll("text").attr("x", function (d) {
                    return d.x;
                })
                    .attr("y", function (d) {
                    return d.y;
                });
            });
        
        
        function colornode(a) {
            if (a == 16)
                {return '#0652DD';}
            if (a > 18)
                {return '#FFC107';}
            if (a == 19)
                {return '#FF9800';}
            if (a == 24)
                {return '#FF9800';}
            if (a == 29)
                {return '#FF9800';}
            if (a == 33)
                {return '#FF9800';}
            else
                {return '#009432';}
        }
        
        var toggle = 0;

        var linkedByIndex = {};
        for (var i = 0; i < graph.nodes.length; i++) {
            linkedByIndex[i + "," + i] = 1;
        };
        graph.links.forEach(function (d) {
            linkedByIndex[d.source.index + "," + d.target.index] = 1;
        });
        function neighboring(a, b) {
            return linkedByIndex[a.index + "," + b.index];
        }
        
        function connectedNodes() {
            if (toggle == 0) {
                var d = d3.select(this).node().__data__;
                
                link.style("opacity", function (o) {
                    return d.id==o.source.index | d.index==o.target.index ? 1 : 0.8;
                });
                link.style("stroke-width", function (o) {
                    return d.index==o.source.index | d.index==o.target.index ? o.weight : 0.8;
                });
                node.style("opacity", function (o) {
                    return neighboring(d, o) | neighboring(o, d) ? 1 : 0.3;
                });
                //Reset the toggle.
                toggle = 1;
            } else {
                //Restore everything back to normal
                node.style("opacity", 1);
                link.style("opacity", 1);
                link.style("stroke-width",function(d) { return d.weight; });
                toggle = 0;
            }
        } 

        
    });
});
</script>
"""

In [25]:
display(HTML(js_getResults2))
With a threshold of 0.1 for the probabilities, K9, K13, K2, K8 and may be K1 could be Rose-crested Blue Pipits. Other have only a very small probability to belong to this specy.
In [305]:
list_kasios_bp = [i for i in range(nb_kasios) if probas_kasios[i,16]>0.1]

Top-3

In [306]:
print("--------------3 MOST PROBABLE SPECIES BY KASIOS RECORD------------------------")
print("******************************************************************************")
for j in range(nb_kasios):
    bestprobas = heapq.nlargest(3, zip(probas_kasios[j,:], cat_bird.categories.tolist()))
    print("Kasios record #", j+1, " : ")
    for i, bestproba in enumerate(bestprobas):
        print("      {} : {} , proba {}".format(i+1, bestproba[1], bestproba[0]))
--------------3 MOST PROBABLE SPECIES BY KASIOS RECORD------------------------
******************************************************************************
Kasios record # 1  : 
      1 : Orange Pine Plover , proba 0.2325603567771147
      2 : Rose-crested Blue Pipit , proba 0.16442669980066366
      3 : Vermillion Trillian , proba 0.10882016278833276
Kasios record # 2  : 
      1 : Rose-crested Blue Pipit , proba 0.1828847858081056
      2 : Ordinary Snape , proba 0.16586969109055938
      3 : Eastern Corn Skeet , proba 0.14791976605304016
Kasios record # 3  : 
      1 : Bombadil , proba 0.16420298016867563
      2 : Orange Pine Plover , proba 0.15570455155746507
      3 : Ordinary Snape , proba 0.09036977364045032
Kasios record # 4  : 
      1 : Bombadil , proba 0.2408088235294117
      2 : Darkwing Sparrow , proba 0.1798713235294117
      3 : Lesser Birchbeere , proba 0.0842830882352941
Kasios record # 5  : 
      1 : Orange Pine Plover , proba 0.2610922436459246
      2 : Blue-collared Zipper , proba 0.15312773882559158
      3 : Scrawny Jay , proba 0.1280400964066608
Kasios record # 6  : 
      1 : Green-tipped Scarlet Pipit , proba 0.27609917788936794
      2 : Darkwing Sparrow , proba 0.2494766266157198
      3 : Bombadil , proba 0.0863967946339839
Kasios record # 7  : 
      1 : Scrawny Jay , proba 0.187830153950836
      2 : Orange Pine Plover , proba 0.17386492199104678
      3 : Canadian Cootamum , proba 0.11826407557926134
Kasios record # 8  : 
      1 : Rose-crested Blue Pipit , proba 0.16334455667788997
      2 : Ordinary Snape , proba 0.1615039281705948
      3 : Eastern Corn Skeet , proba 0.14051627384960716
Kasios record # 9  : 
      1 : Rose-crested Blue Pipit , proba 0.2394643764956788
      2 : Eastern Corn Skeet , proba 0.18324932733201899
      3 : Queenscoat , proba 0.13955148234944015
Kasios record # 10  : 
      1 : Orange Pine Plover , proba 0.6426664296506779
      2 : Vermillion Trillian , proba 0.13722955772870143
      3 : Lesser Birchbeere , proba 0.04267468332518959
Kasios record # 11  : 
      1 : Queenscoat , proba 0.35293442086051535
      2 : Bombadil , proba 0.1088479543367278
      3 : Rose-crested Blue Pipit , proba 0.0989970223681994
Kasios record # 12  : 
      1 : Orange Pine Plover , proba 0.3993375683588326
      2 : Broad-winged Jojo , proba 0.09702526040192931
      3 : Vermillion Trillian , proba 0.08161612805286411
Kasios record # 13  : 
      1 : Rose-crested Blue Pipit , proba 0.19191919191919193
      2 : Ordinary Snape , proba 0.17171717171717174
      3 : Queenscoat , proba 0.10101010101010102
Kasios record # 14  : 
      1 : Darkwing Sparrow , proba 0.27003932108787027
      2 : Lesser Birchbeere , proba 0.2017344855143598
      3 : Eastern Corn Skeet , proba 0.13489953025497747
Kasios record # 15  : 
      1 : Queenscoat , proba 0.1530037976729153
      2 : Orange Pine Plover , proba 0.14012403038138332
      3 : Bombadil , proba 0.1109008026287438

Comparison feature / predicted feature

In [307]:
from scipy.stats import mode
In [308]:
predict = random_forest.predict(X_kasios)

i = 0
predict_kasios = []
for j in range(nb_kasios):
    predict_kasios.append(mode(np.array([predict[i+k] for k in range(kasios_id.value_counts()[j])]), axis=None))
    i+=kasios_id.value_counts()[j]
predict_kasios = np.array(predict_kasios).T[0][0].astype(int)
In [309]:
def plot_comparaison_predict(prediction):
    fig, axs = plt.subplots(len(l_kasios_id), 9, figsize=(18,40))
    plt.subplots_adjust(hspace = 0.5)
    for i, H1, H2, H3 in zip(l_kasios_id, features1k, features2k, features3k):
        title_bp = ""
        if i in list_kasios_bp:
            title_bp = ", may be a Blue pipit "
        for j in range(9):
            axs[int(i),j].axis('off')
        plot3_features(cat_bird.categories.tolist()[i_bp], features1[i_bp], features2[i_bp], features3[i_bp], freqs, axs[int(i),0:3])    
        plot3_features("K#" + str(int(i + 1)) + title_bp, H1, H2, H3, freqs, axs[int(i),3:6])
        plot3_features("predicted : K#" + str(int(i + 1)) + "=" + cat_bird.categories.tolist()[prediction[int(i)]], features1[prediction[int(i)]], 
                       features2[prediction[int(i)]], features3[prediction[int(i)]], freqs, axs[int(i),6:9])

In [310]:
plot_comparaison_predict(predict_kasios)
The prediction is not to bad with these features. Without classification ML methods, we are unable to predict eyeballed the specy.

Dimensionality Reduction (no result?)

t-SNE

In [128]:
from sklearn import manifold
In [129]:
tsne = manifold.TSNE(n_components=2, init='random',
                     random_state=0, perplexity=100)
In [130]:
Y = tsne.fit_transform(X)
In [131]:
fig = plt.figure(figsize=(18, 18))
plt.scatter(Y[:, 0], Y[:, 1],marker='.' , c=y, cmap='tab20', label=y)
plt.title('Titre')
#plt.legend(loc=2, scatterpoints=1)
plt.show()

UMAP

In [132]:
import umap
In [133]:
embedding = umap.UMAP(n_neighbors=25,
                      min_dist=0.3,
                      metric='correlation').fit_transform(X)
In [134]:
fig = plt.figure(figsize=(18, 18))
plt.scatter(embedding[:, 0], embedding[:, 1],marker='.' , c=y, cmap='tab20', label=y)
plt.title('Titre')
#plt.legend(loc=2, scatterpoints=1)
plt.show()

Conclusion

In [311]:
plotmap(list_kasios_bp)
We can assume that #2, #9, #8, #13 are Blue Pipits, and #1 too (even if the most probable specy of #1 is Orange Pine Plower).
But other Kasios records seem to be issued from other species, the sincerity of Kasios has to be challenged.

We can't support the report of Kasios: "there are plenty of Rose-crested Blue Pipits happily living and nesting in the Preserve". We can only conclude that there is still at least a small Blue Pipits population in the Preserve, which mainly probably not nest in the dumping area.
We indeed observed that the number of Blue pipits recorded in the vicinity of this dumping site decreased in recent years.

Machine learning is powerful, but the human Mitch Vogel's experience is essential.
In [ ]: